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Abstract 

The human vocal folds are known to interact with the vocal tract 
acoustics during voiced speech production; namely a nonlinear source- 
' filter coupling has been observed both by using models and in in vivo 

phonation. These phenomena are approached from two directions in 
this article. We first present a computational dynamical model of 
the speech apparatus that contains an explicit filter-source feedback 
mechanism from the vocal tract acoustics back to the vocal folds oscil- 
lations. The model was used to simulate vocal pitch glides where the 
trajectory was forced to cross the lowest vocal tract resonance, i.e., 
the lowest formant F±. Similar patterns produced by human partici- 
pants were then studied. Both the simulations and the experimental 
results reveal an effect when the glides cross the first formant (as may 
happen in [i]). Conversely, this effect is not observed if there is no 
formant within the glide range (as is the case in [a]). The experiments 
show smaller effect compared to the simulations, pointing to an active 
compensation mechanism. 

Keywords. Speech modelling, vocal folds model, flow induced vibrations, 
modal locking. 

PACS. Primary 4370Bk. Secondary 4370Gr, 4370Aj, 4370Mn. 
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1 Introduction 



It is an underlying assumption in the linear source-filter theory of vowel 
production that the sound source (i.e., the vocal fold vibration) operates 
independently of the filter (i.e., the vocal tract, henceforth VT) whose res- 
onances modulate the resulting vowel sound PQE]- This classical approach 
captures a wide range of phenomena in speech production, at least in male 
speakers. However, significant observations remain unexplained by the feed- 
back free source-filter model, such as some fine structure in the phonation of 
a female singer near the lowest formant F\. Instability of the fundamental 
(glottal) frequency f has been detected acoustically [3] and at tissue level 
[I] when a singer performs an / -glide over F\ on a steady vowel. As argued 
by Titze [5j, the observed frequency jumps are due to a feedback coupling 
from the vocal sound pressure back to the glottal pulse generation, denoted 
there as the nonlinear source-filter coupling. Since F\ usually lies well above 
/o in male phonation, this phenomenon occurs typically in female subjects 
when they are producing vowels with a low F\. In laboratory experiments, 
Titze et al. found more instabilities in male subjects, possibly because males 
have less experience in suppressing unwanted instabilities [3]. 

The vocal source instabilities have been modeled by low-order mass-spring 
systems. A two-mass vocal folds model, coupled with a resonator tube, 
showed coupling related effects when the dimensions of the tube were ma- 
nipulated [6]. Tokuda et al. simulated vocal pitch glides using a four- mass 
model to analyze the interactions between vocal register transitions and VT 
resonances [7]. 

For the current study, the fo-Fi cross-over phenomenon is approached 
from two complementary directions: 1) by simulating / -glides numerically 
on a steady vowel where Fi is near f , using a dynamical vowel model that 
includes the filter-source feedback loop described above; and 2) by carrying 
out a related vowel glide production experiment on female test subjects. 

2 The vocal folds model 
2.1 Anatomy 

Before introducing the glottis model, some anatomic details are reviewed, 
and physiological control mechanisms (actuated by muscles) affecting the 
speech characteristics are explained. Such explanation should be regarded 
as an idealization since the effect of a single muscle contraction can to some 
extent be compensated by other muscles involved. The glottis model is based 
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(a) 



(b) 



Figure 1: (a): Sketch of the anatomy of the glottis according to Gray, (b): 
The geometry of the glottis model and the symbols used. The trachea (i.e. 
the channel leading from the lungs to glottis) is to the left in this sketch and 
the vocal tract is to the right. 

on a further simplification of this idealization. 

The vowel sound is produced by the self-sustained cut-off effect of the 
air flow from lungs, caused by a quasi-periodic closure of an aperture - 
known as the rima glottidis — by two string-like vocal folds. This process 
is called phonation, and the system comprising the vocal folds and the rima 
glottidis is known as the glottis. A single period of the sound pressure signal, 
produced by the glottal flow, is called the glottal pulse. 

As shown in Fig. [T^i, each vocal fold consists of a vocal cord (also known as 
a vocal ligament) together with a medial part of the thyroarytenoid muscle, 
and the vocalis muscle (not specified in Fig.Hk). Both vocal folds are attached 
to the thyroid cartilage from their anterior ends and to two corresponding 
arytenoid cartilages (i.e., the left and the right) from their posterior ends; see 
Fig. [Th. In addition to these three cartilages, there is the ring-formed cricoid 
cartilage whose location is inferior to the thyroid cartilage. The vocal folds 
and the associated muscles are supported by these cartilages as explained 
next. 

Between each of the arytenoid cartilages and the cricoid cartilage, there 
are two muscles attached. These are the posterior and the lateral cricoary- 
tenoid muscles, and they have opposite mechanical actions. Indeed, during 
phonation the vocal folds are adducted by the contraction of the lateral 
cricoarytenoid muscles, and conversely, abducted by the posterior cricoary- 



3 



tenoid muscles during, e.g., breathing. This control action is realized by a 
rotational movement of the arytenoid cartilages in a transversal plane. In 
addition, there is a fifth (unpaired) muscle — the arytenoid muscle — whose 
contraction brings the arytenoid cartilages closer to each other, thus reducing 
the opening of the glottis independently of the lateral cricoarytenoid mus- 
cles. These rather complicated control mechanisms regulate, in particular, 
the type of phonation in the breathy-pressed scale. 

The fundamental frequency fo of a vowel sound is controlled by an- 
other mechanism that is actuated by two cricothyroid muscles (not visible in 
Fig. Wfr). The contraction of these muscles leads to a rotation of the thyroid 
cartilage with respect to the cricoid cartilage. Because of this rotation, the 
thyroid cartilage inclines to the anterior direction, thus extending the vocal 
folds. The elongation of the string-like vocal folds leads to increased stress 
which raises the fundamental frequency of their longitudinal vibrations. 

2.2 Glottis model 

The anatomic configuration in Fig. is modelled by a low-order mass-spring 
system shown in Fig. [Tb, based on earlier work of Aalto [HI HO]- Numerical 
efficiency and theoretical tractability favor a low-order model, and the aim 
of the model is at functionality rather than at anatomic detail. The model 
is able to reproduce accurately the measured male glottal pulse obtained 
by inverse filtering [HI E]. The model supports non-symmetric vocal fold 
vibrations, and both the fundamental frequency fo as well as the phonation 
type can be chosen by parameter values. Register shifts (e.g., from modal 
register to falsetto) are important phenomena not in the scope of this model. 
Accounting for register shifts would require either modelling the vocal folds 
as aerodynamically loaded strings or by a high-order mass-spring system that 
has a string-like "elastic" behavior. 

The vocal fold model in Fig. [lb consists of two wedge-shaped vibrating 
elements that have two degrees of freedom each. The distributed mass of 
these elements can be reduced into three mass points which are located so 
that rriji is at x = L, rrij 2 at x — 0, and at x — L/2. The elastic support 
of the vocal ligaments is approximated by two springs at points x = aL and 
x = bL. The equations of motion for the vocal folds are given by 

M x W x {t) + B x W x {f) + KxWxit) = -F(t), 
M 2 W 2 (t) + B 2 W 2 (t) + K 2 W 2 (t) = F(t), teR. 

where Wj = (wji,Wj 2 ) T are the displacements of the right and left endpoints 
of the j th fold, j = 1, 2. The respective mass, damping, and stiffness matrices 
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Mj, Bj, and Kj in are 
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and Kj = P 
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b j2 



a 2 kji + b 2 kj 2 ab(kji + kj 2 ) 
ab(kji + kj 2 ) b 2 kj 1 + a 2 kj 2 



(2) 



The entries of these matrices are computed by means of Lagrangian mechan- 
ics. The damping matrices Bj are diagonal since the dampers are located 
at the endpoints of the vocal folds. The springs are located symmetrically 
around the midpoint x = L/2, so that a = (L/2 + l)/L and b = (L/2 — l)/L. 
The control parameter P > is used for simulating variable frequency /o- 
glides for the purpose of this work. 

During the glottal open phase (when AWx(t) > 0), the load terms of 
([T]) are given by F = (-Fa,i> Fa,2) T as given in Eq. (JH]). During the glottal 
closed phase (when Alfi(it) < 0), there are no aerodynamic forces apart 
from the acoustic counter pressure from the VT, denoted by p c and properly 
introduced in the context of Eq. (TH). Instead, there is a nonlinear spring force 
for the elastic collision of the vocal folds, given by the Hertz impact model 



Hp-Hj/2 Hi ^ . 



Pc 



2L 



Pc 



(3) 



The model geometry is shown in Fig.[Tb, and it corresponds to the coronal 
section through the center of the vocal folds. As always in such biomechanical 
modelling [3 [121 US]; the lumped parameters of the mass-spring system are 
in some correspondence to the masses, material parameters, and geometric 
characteristics of the sound producing tissues. Such parameters are, e.g., the 
mass and stiffness matrices Mj and Ki, i — 1, 2, that appear in the equations 
of motion for the vocal folds ([Tj). 

More precisely, matrices Mj correspond to the vibrating masses of the vo- 
cal folds, including the vocal ligaments together with their covering mucous 
layers and (at least, partly) the supporting vocalis muscles. The elements of 
the matrices Kj are best understood as linear approximations of k(s) — f/s 
where / = f(s) is the contact force required for deflection s at the center of 
the string-like vocal ligament in Fig. [Th. It should be emphasized that the 
exact numerical correspondence of tissue parameters to lumped model param- 
eters Mj and Kj is intractable (and for practical purposes, even irrelevant), 
and their values in computer simulations must be tuned using measurement 
data of fo and the measured form of the glottal pulse [9]. Even though the 
equations of motion are separate for both vocal folds, the parameters (hence, 
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the simulated vocal folds movements) are symmetric in all of the simulations 
reported in this paper. 



3 Full model of vowel production 

Having treated the modelling of the vocal folds, it remains to review the 
other components of the full model, i.e., the ID incompressible flow model 
and the VT acoustics model. 

These two subsystems are coupled so that the flow depends on the time- 
dependent glottal opening. Conversely, the flow produces aerodynamic forces 
on the vocal folds, and it also acts as the acoustic source to the resonating 
VT, modelled by Webster's equation. The acoustics of the sub-glottal air 
cavities is not modelled at all. The VT sound (counter) pressure gives rise to 
the filter-source feedback from VT to the glottal oscillations. Without this 
feedback, modal locking does not appear at all as can be verified by running 
model simulations with /o ~ F\ with p c = in fl6]). 



3.1 Flow 



An incompressible ID flow through the glottal opening with velocity v a is 
described by 



V a {t) 



Psub 



a, 



Mt) 



(4) 



where the latter term inside the parentheses (representing the viscous pres- 
sure loss) is motivated by the Hagen-Poiseuille law in a narrow aperture. 
The constant sub-glottal pressure (subtracted by the ambient air pressure) 
is denoted by p su b, and h is the width of the flow channel that is assumed 
rectangular. The parameter Ci ner regulates the flow inertia, and C g regulates 
the pressure loss in the glottis. Aalto et al. observed that Ci ner effectively 
reveals the phonation type when other model parameters are estimated based 
on recorded speech signals 0. 

The viscous pressure loss in PJ depends on the glottal opening at the 
narrowest point, i.e., AW\. At the other end (i.e., towards the trachea) the 
opening is AW 2 - These are given by (TjQ) through 



AW 2 



W 2 -W 1 + 
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(5) 



The parameter g is the glottal opening when there is no flow and the vocal 
folds do not vibrate (W\ = W 2 = 0). In human anatomy, the parameter g is 
related to the position and orientation of the arytenoid cartilages. 
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In the glottis, the flow velocity V(x, t) is assumed to satisfy the mass con- 
servation law H(x,t)V(x,t) = H]V a (t) for incompressible flow where H(x,t) 
is the height of the flow channel inside the glottis. In the model geometry of 
Fig. [lb, we have 

H(x,t) = AW 2 (t) + y(AW 1 (t)-AW 2 (t)), xe [0,L]. 

Now the pressure p(x, t) in the glottis is given by the two equations above 
and the Bernoulli law p(x, t) + \pV{x, t) 2 = p sub for static flow. 

Since each vocal fold has two degrees of freedom, this pressure and the 
VT counter pressure p c can be reduced to a force pair (Fa,i, Fa,2) T where 
Fa,i affects at the narrow (superior) end of the glottis (x = L) and Fa,2 at 
the wide (resp. inferior) end (x — 0). This reduction is carried out by using 
the total force and moment balance equations 

F A ,i + F A ,2 = hi {p(x, t) - p sub ) dx 
Jo 

and 

L ■ F A ,i = h J x(p(x,t) -p sub ) dx-p c - h H ^ H ° - ■ 

The moment is evaluated with respect to point (x, y) = (0, 0) for the lower 
fold and (x, y) = (0, H ) for the upper fold in Fig. [Tb. Evaluation of these 
integrals yields 

- A,l 2^ u o lb ^y £AVi(AW2-AWi) (AW1—AW2) 2 yAWi J ) 4L '^ c ' 

F An l -mi 2 hT ( H * 1 /Wa^ ■ Hi(H -Hi/2) , 

?A,2 - iPVo' 1 ^ \&Wi{AWi-AWx) {AW 1 -AW 2 ) 2 m \AW! J J ^ 4L 



(6) 



3.2 Vocal tract 



The vocal tract acoustics is modeled by Webster's lossless horn resonator. 
The governing equation for the velocity potential ty(s,t) is 

where c is the sound velocity. The parameter s G [0, L VT ] is the distance 
from the narrow (superior) end of the glottis measured along the VT center 
line, and Lyr is the length of the VT. The area function A(-) is the cross- 
sectional area of the VT, perpendicular to the VT center line. The sound 
pressure is given in terms of the velocity potential by p = p^ t - 
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At the glottis end, the resonator is controlled by the flow velocity v a from 
Eq. through the boundary condition \I/ S (0, i) = —v (t). The resonator 
exerts a counter pressure p c (t) = p\E' t (0,t) to the vocal folds equations ([I]) 
through Eqs. thereby forming a filter-source feedback loop. The bound- 
ary condition at lips is a frequency-independent acoustic resistance of the 
form ty t (LvTj t) + 9c^/ s (L VT , t) = where 9 is the normalized acoustic resis- 
tance [13] . 

3.3 Model parameters 

The glottal flow equations (BJ contain two independent parameters p su b/C g 
and Ciner] the mass-spring system contains three parameter matrices Mj, Bj, 
and Kj for j = 1, 2; and the resonator equations contain the area function 
A(-) in (J2J) and the acoustic termination parameter 9 at the mouth. 

Out of these parameters, p su b/C g and 9 are determined from physical 
considerations and A(-) from anatomical data obtained by MRI. The area 
function used in simulations corresponds to [0:] as in Hannukainen et al 
|14j . The parameter range for Cj ner has then been determined and validated 
so as to produce a realistic time-domain glottal pulse form [9j; the values 
corresponding to pressed phonation are used in this work. All these model 
parameter values introduced so far are equally valid for both female and male 
phonation. 

It remains to consider the parameter matrices in the vocal folds equations 
(TO where the differences between female and male phonation are significant. 
Horacek et al. provide parameter values in male phonation [T5J [12] . The 
parameter values (Ki, K-i) have been estimated indirectly by requiring the 
correct fundamental frequency /o [Ej. The focus of this paper is in female 
phonation, and it is difficult to produce similar data for female subjects using 
literature. Thus, the typical nominal "male versions" of parameters Mj and 
Kj for j — 1, 2 (as given in Aalto [H [9]) are scaled so as to obtain typical 
"female version" as explained above. This scaling is based on the data given 
by Titze [16J. 

Dimensional analysis and scaling yield 

^female = A a M male and -^female = A° ^male 

where the exponent a G [2, 3] by physical grounds. Based on experimenta- 
tion, the value used here is a = 2.3. The scale factor A ~ 0.6 is suggested 
by Titze [16] but here the value A = 0.55 produces more female like phona- 
tion in simulations. The resulting increase in the resonances of fll]) by factor 
1/A ~ 1.8 reflects correctly the higher pitch of the female voice. 
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The /o-glide is simulated by additional scaling of the matrices Kj whereas 
the matrices Mj are kept constant. This is based on the assumption that the 
vibrating mass of vocal folds is not significantly reduced when the speaker's 
pitch increases; a reasonable assumption as far as register changes are ex- 
cluded. The authors would like to remark that the relative magnitudes of 
Mj and Kj essentially determine the resonance frequencies of model ([T]). 
However, attention must be paid to their absolute magnitudes using, e.g., 
dimensional analysis since otherwise the load terms ±F(t) in (JTJ) (contain- 
ing the aerodynamic forces, contact force between the vocal folds during the 
glottal closed phase, and the counter pressure from the VT) would scale in 
an unrealistic manner. 

The damping parameters bji, i,j = 1,2, in Eq. play an important 
but problematic role in glottis models. If there is too much damping (while 
keeping all other model parameters fixed), sustained oscillations do not oc- 
cur. Conversely, too low damping will cause instability in simulated vocal 
folds oscillations. The magnitude of physically realistic damping in vibrating 
tissues is not available, and the present model could possibly fail to give a 
quasi- stationary glottis signal even if realistic experimental damping values 
were available for use. For simplicity, we set bji = (3 > for i, j = 1, 2, and 
the value of glottis loss is adjusted separately for each parameter value 
set in order to obtain stable but sustained oscillation. In particular, param- 
eter (3 is adjusted every 100 ms during the / -gbde simulations presented 
below. The damping remains always so small that its lowering effect on the 
resonances of the mass-spring system ([1]) is negligible. 

Let us conclude by discussing the parameter magnitudes in Eq. ([1]). The 
total vibrating mass is mji +mj2+mjs = 0.48 g for male and 0.12 g for female 
phonation. The total spring coefficients are kji + kj 2 = 193 N/m for male 
and 161.3 N/m for female phonation using the nominal values, i.e., when 
P = 1 in Eq. (T5]). The nominal values yield fo = 110 Hz for male and 187 
Hz for female phonation. If the characteristic thickness of the vocal folds is 
assumed to be about 5 mm, these parameters yield a magnitude estimate for 
the elastic modulus of the vocal folds by E « -^jt^ ■ 5 ■ 10~ 3 m « 6.6 kPa. 
This is in good comparison with Fig. 7 in Chhetri et al. [T7] where estimates 
are given for the elastic modulus of ex vivo male vocal folds between 2.0 kPa 
and 7.5 kPa for different parts of the vocal folds. 

3.4 Numerical realization 

The model equations are solved numerically using MATLAB software and 
custom-made code. The vocal fold equations of motion ([1]) are solved by the 
fourth order Runge-Kutta time discretization scheme. The discontinuity of 
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the load F(t) at AWi(t) = is dealt with by an interpolation procedure 
detailed in Aalto (Section 2.4) [8J. The flow equation (jl]) is solved by the 
backward Euler method. The VT is discretized by the FEM using piecewise 
linear elements (N = 100) and the physical energy norm of Webster's equa- 
tion. Crank-Nicolson time discretization is used, and the time step is always 
20 fis. 

4 Simulation results 

Frequency glides of vowel [0:] are simulated near the lowest formant Fx or its 
subharmonic Fi/2. In these simulations, Fx (determined from spectrograms 
of simulated vowel signals) coincides with the lowest resonance of the VT 
(solved independently from the eigenvalue problem associated to (JTj) and the 
boundary conditions). To produce the glides, parameter P in Eq. ([2]) is in- 
creased (or decreased) as a quadratic function of time. Then the oscillation 
frequency of the vocal fold model ([T]) in the absence of the counter pressure 
p c — denoted by f — increases (respectively, decreases) as a linear func- 
tion of time. This can be observed in simulation results if the filter-source 
feedback mechanism is disconnected by setting the counter pressure p c = 
in Eq. When using p c (t) = pty t (0,i) with Eq. (J7|), the observed funda- 
mental frequency fo of the whole system does not behave linearly in time (in 
contrast to / ) but exhibits jumps near Fx (and Fx/2) as shown in Fig. [3b. 
This behavior, the modal locking, is due to the filter-source feedback 0] . 

Changing the area function A(-) in Eq. <^ to correspond some other 
vowel than [0:] does not change the results of model simulations apart from 
formant, and hence, modal locking frequencies. 

4.1 fo—F\ crossover 

Frequency /o-glides of length 2 s are simulated by varying the parameter P 
of the glottis model so that 350 Hz < f < 810 Hz linearly in time. The 
increasing phase is shown in the spectrogram Fig. [3Ji with an auxiliary line 
showing the glide of fo and another line showing F\ = 647 Hz. It is observed 
that fo coincides first with / , but then it suddenly jumps upwards to Fx 
when it reaches about 470 Hz. The wave form of the glottal pulse near the 
transition is a superposition of two signals with frequencies fo and Fx- When 
fo exceeds Fx, then / and fo coincide again. The qualitative behavior is 
sketched in the rising part of Fig. [3b. 

In a downwards glide the behavior is almost symmetric. First fo = fo 
descends down to Fx where fo locks for a while. The latent glottal model 
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frequency / , of course, goes down linearly without change. After a time lag, 
fo is released from F\ and drops suddenly to fo- This behavior is shown in 
the descending part of Fig. [3b. 

As can be seen in Fig. [3b, there are two periods (of length T u ,Td > 0) 
during which f = F\. Let us denote the corresponding frequency jumps by 
> 0. By Fig. [3b we see 4> U /T U = 4>d/T d is determined by the ascend 
and descend rate of the simulated, symmetric / -glide. We observe <p d > (fi u 
consistently in all simulations, and if the /o-ghde is very slow, we observe that 
4>d ~ 4>u and Fig. [3b becomes symmetric. Simulating extremely slow glides 
(during which the parameter P can be regarded as a constant), it is observed 
that the jumps <pd arid (f) u have a common lower bound (f) ~ 174 Hz. This 
number can be regarded as the "true" magnitude of the simulated frequency 
jump, and it is in reasonable correspondence with the results reported in 

HUE]. 

It was noted above that <pd> 4>u- One way to understand this asymmetry 
is in terms of the energy dissipation from VT resonance modes. The only 
energy losses in Eq. (0) are due to the dissipative boundary condition at lips. 
During a downwards glide, there is a locking of fo at Fi, and then a lot 
of energy is contained in the corresponding eigenmode of the joint system, 
comprising the VT air column and the vocal fold masses. For the vocal 
folds oscillations to get "unlocked" from this eigenmode, most of this energy 
must first be dissipated either by mouth radiation or by dispersion to other 
eigenmodes as a consequence of nonlinearity and lack of time invariance of 
the model. The dissipation rate may be slow compared to the speed of 
decrease of the /o-glide. Thus, the latent frequency Jo m ay have fallen far 
below F± before unlocking of fo may take place. It is therefore expected that 
the time delay T<j is sensitive to relative magnitudes of losses and amounts of 
energy stored in vocal folds and VT, and this is supported by the simulations. 
Indeed, Td becomes very large if the vocal folds oscillation amplitude is small 
during the glide. 

4.2 Glides near subharmonics of F\ 

Similar / -glides are produced as explained above but now 187 Hz < f < 
390 Hz linearly during 2 s time interval. The subharmonic F±/2 — 324 Hz lies 
in this interval, and two kinds of behavior are observed. First, fo increases 
from below in Fig. [2^ until F\j2 is reached; there is no locking at Fi/2 but 
fo jumps one octave up and locks at F\. Second, the phenomenon does not 
appear at all if the vocal fold oscillation amplitude is large enough. 

In Fig. [2b there is a downward glide near F\j2. First and third harmonics 
of fo have a static part matching F\ and 2F\ respectively. It may be possible 
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Figure 2: (a): Upwards /o-glide over Fi/2. (b): Downwards /o-glide over 
Fx/2. 

that f is "partly locked" to Fx/2 but there is a clear component that develops 
with fo- Such a complicated spectral behavior emphasizes the surprising 
dynamical richness of the vowel production model presented in this paper. 

5 Glide production experiment 

The simulation results and earlier experimental evidence [3] suggest that 
consequences of filter-source feedback should be detectable at crossings of 
frequencies fo an d Fx. Unfortunately, determining the precise crossing time 
of fo and Fx in vowel glides is difficult when based on the acoustic signal 
alone. Indeed, the harmonics 2/ , 3/o, ... do not coincide with the bandwidth 
of F\ if f ss F%. If fo follows successfully a slowly varying reference glide (as 
should happen in these experiments), it is then far from a persistently exciting 
signal that would be required for precise detection of F\. In a similar glide 
production experiment, Titze et al. [3] asked the test subjects to produce 
a spectrally rich vocal fry adjacent to every glide. A good estimate for F\ 
can be obtained this way, but the formant may creep during the glide. The 
subject is, in fact, expected to use various techniques to avoid modal locking 
in order to produce audibly clean vowel glides, and F\ position is expected 
to be affected near f . We conclude that obtaining the crossing time of fo 
and Fx from acoustic signals remains a difficult estimation problem. 

Rather than replicating the experimental arrangements of Titze et al. [3] , 
a complementary approach is taken here without Fx as a controlled variable. 
This is achieved by eliciting vocal glides for two different vowels [a] and [i], 
where [i] has Fx within the fo glide range, and F\ of [a] is as far as possible 
from the same range. The model simulations predict that the /crt ra j Tories 
corresponding to [a] and [i] should have distinctly different characteristics 
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Figure 3: (a): / -glide 350 Hz - 810 Hz. (b): A sketch of the modal locking 
in an / -glide over Fi first upwards and then downwards. The thick (thin) 
line shows / (resp. / ) during the glide. 

that should be detectable in natural glide productions. 

5.1 Quantification of the jumping pattern 

During a modal locking episode, f is expected to jump as in Fig. [3J To 
quantify the jump effect for statistical analysis, the Duration Ratio {DR) of 
the glide fo = fo(t) is defined. For frequencies Ja < fa < h < /b, DR is 
defined by 

DR(fo) = (8) 

where (tA,ts) is the longest open interval where Ja < fo(t) < fs for all 
t G (tA, is), and {t a ,tf,) is the longest open interval where f a < fo(t) < fb for 
all t G (t a ,tb) such that \t a ,tb] C (tA, i.e., tA < t a < < This implies 
< DR(fo) < 1 for any glide /o- By definition, Di? does not depend on 
the direction or the speed of the glide in the sense that DR(f ) = DR(f ) = 
DR(fS) where f {t) = f {-t) and f$(t) = f {at) for a > 0. 

In the current experiments, all octave glides are between 200 Hz and 400 
Hz, and the parameter values for Eq. ([8]) are always J'a = 7 • 200 Hz 230 
Hz, /„ = 1 f A ~ 264 Hz, f b = 1 f a w 303 Hz, and / fl = 7 / 6 w 348 Hz where 
7 = 2 1 / 5 w 1.148698. Hence, the octave [200 Hz, 400 Hz] is divided into five 
intervals equal in logarithmic scale. We say that the glide / is full if its 
range contains all of the interval [/a, /#]• If fo(t) = a2 kt for some a > 0, 
k 7^ 0, and t G [0, T] is a full glide, then we call it logarithmically linear and 
we have DR(fo) = 1/3 which can be regarded as the nominal value of DR 
on full glides in the absence of any perturbations. 

For an ideal logarithmically linear full glide fo with a single modal locking 
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"jump" as in Fig. [31 we have DR(f ) < 1/3 if the jump intersects /& but not 
fa or f B . Similarly, DR(f ) > 1/3 if the jump intersects f a but not f b or 
/a- The jump size and the jump position F (satisfying F = F\ in glide 
simulations) change the DR in a practically convenient way: to see this, 
a numerical experiment was performed. Full, logarithmically linear jump 
patterns were constructed using jump size of 100 Hz and one million tokens 
of F drawn from a normal distribution with /i = 300 Hz and a = 50 Hz. 
This produced E[DR] = 0.308 and SD[DR] = 0.325. Keeping a constant 
but varying jx gave the following values: E[Di£] = 0.354 and SD[DR] = 0.270 
at ll = 250 Hz; E[DR) = 0.240 and SD[DR] = 0.270 at fi = 350 Hz; and 
finally E[DR) = 0.333 and SD[DR] = 0.002 at ll = 600 Hz as expected. 

The experiment is designed so as to create situations where glides of [i] 
are more likely to have lower DR compared to glides of [a], due to modal 
locking induced /o-jumps over /& for [i] but not at all for [a]. More precisely, 
the position of F\ of [i] is expected to be roughly at 300 Hz (which, of course, 
explains the choice of /& above), and the position of F\ of [a] is likely to be 
above 600 Hz. Thus, simulated model predictions of jump patterns can be 
formulated as the statistically testable hypotheses: 

Hypothesis 1. The population mean of the Duration Ratio on full [i] -glides 



fl i] and on full [a]-glides f [ a] satisfies E DR(f^ J ) < E DR(ft 
Hypothesis 2. The population mean of the Duration Ratio on full [i] -glides 



c[oh 



/q satisfies E DR(fl ) 11 ) < 1/3, and the population mean of the Duration 



Ratio on full [a] -glides /q satisfies E DRiyf^ 1 ) = 1/3. 

5.2 Subjects and the recording arrangement 

Eleven native Finnish speaking female students at the University of Helsinki 
were recruited for the experiment. Females were chosen as subjects because 
(non-singing) males have typically little familiarity of using their vocal range 
around their naturally occurring F 1 . Hence, using males would probably lead 
to an overly high rejection rate of glide samples and excursions to falsetto 
register in otherwise successful glides. 

None of the participants reported any hearing or voice problems, and 
they had not received professional training in singing although some of them 
had a musical hobby (like violin playing). The participants were informed 
of the general purpose of the research and presented with the gliding task. 
There was a short familiarization period before the actual experiment during 
which the participants could practice the gliding with the vowel [a]. Some 
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participants were helped to find the right pitch range. The data was recorded 
in a sound-proof studio with a high-quality microphone (AKG C4000B), and 
digitized with Digidesign (DIGI 002) and ProTools v.9. 

5.3 Glide imitation stimuli 

The subjects were instructed to produce frequency glides by following a spec- 
trally neutral, synthesized reference glides from 200 Hz to 400 Hz. Every 
reference glide started and ended with a constant frequency part of 0.25 s 
to create a sensation of the right initial (resp., final) pitch of the glide. The 
instruction was given as a pre-recorded 0.5 s sample of either of the vowels, 
followed by three beeps (countdown beep: 0.25 s signal followed by 0.75 s 
silence). The beeps had the pitch of the onset value of the desired glide 
(either 200 Hz or 400 Hz). The glides and beeps were frequency modulated 
triangle waves; the triangle wave function s(t) is a 27r-periodic function with 
constant slope (except for multiplicities of 2n). The upward glides are given 
by s(2nu(t)) with u(t) = I ^^(2 t/T - 1) for t E [0,T] where T is the dura- 
tion of the glide (either 1.5 s or 3.0 s). The instantaneous frequency of such 
a glide is given by ^ = 2*/ T • 200Hz which was assumed to be equal with 
the perceived pitch. 

5.4 Experimental setup 

The subjects were asked to imitate the pitch of reference glides by producing 
them with a prolonged, spoken [a] or [i]. The subjects heard the reference 
glide from earphones at a volume that was adjusted for each subject to be as 
loud as possible without being unpleasant. The purpose of this arrangement 
was to hinder the subjects from hearing their own production easily and to 
elicit good phonation. 

The data was gathered following a factorial 2 3 design with direction (ei- 
ther fall or rise), duration (either fast 1.5 s or slow 3 s), and vowel (either 
[a] or [i]) as factors. Each subject had her own pseudo-randomized list of 
stimuli. The lists were constructed so that the different stimulus types would 
be randomly and evenly distributed in the trial: the stimuli were divided in 
blocks of 24 where each stimulus condition occurred three times, three iden- 
tical stimuli were never in a row (not even over the blocks), four identical 
glide directions, vowels, or glide speeds were never in a row. 
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Figure 4: A typical production of a vowel glide [i]. The linear line depicts 
the guiding signal and the zig-zagging line shows the fundamental frequency 
of the glide. The intervals corresponding to the mid-frequency and total 
durations are shown as horizontal lines. 

5.5 Data analysis 

Three subjects were disqualified since they did not learn the gliding pattern 
well enough. Most of their productions had at least one of the following 
characteristics: the glide was one octave away from the guiding glide signal, 
the glide pattern was missing (static or undulating pitch), or the range of 
the glide was limited to less than a quarter. The remaining eight subjects 
had no difficulties in the gliding task. 

A total of 864 glides were processed further. The vocal pitch was ex- 
tracted by cross-correlation using Praat (Praat 5.2.21; default parameters 
with admissible range from 200 Hz to 400 Hz). The pitch trajectories of the 
glides were automatically chopped from the audio signal, based on the tim- 
ing of the guiding signals and further analyzed in M ATLAB. A custom-made 
algorithm was used to calculate DR(-) from the data. 

The glides were analyzed for their "fullness" to exclude the related bias in 
Duration Ratio values. The range from to Jb was divided in ten equally 
long frequency bands in logarithmic scale. If some of the bands did not 
contain a value of fo, the glide was not considered full. This led to exclusion 
of 8 glides (1%). A mixed linear model was fitted to the remaining data 
using statistical software package R (version 2.14.0) and the model parameter 
selection was based on log-likelihoods [18J. Both of the Hypotheses [T] and [2] 
were independently tested by t-tests. 

A typical production is shown in Fig. H] where the fo trajectory of a 
fast falling [i] is given. The reference glide is the diagonal line with flat 
ends at t — 0, 2. The four horizontal lines indicate the critical frequencies 
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fAifaifb an d fs as introduced above. The corresponding time intervals for 
Eq. (JED are given by (t A ,t B ) = (0.52, 1.59) and [t a ,t b ] = (0.77, 1.03), yielding 
DR(f ) = 0.243 < 1/3. 

5.6 Results 

The estimated means and standard deviations for DR at each eight factor 
combination groups are shown in the Table [TJ The estimated mean DR's of 
[a] are consistently larger than those of [i] (see Hypothesis [1]) , and also the 
nominal DR value 1/3 = 0.33 ... is above the estimated mean of DR for all 
four [i] conditions. 

Table 1: The means and standard deviations for the Duration Ratio (DR) 
over each factor combination. 



Condition 


Mean 


SD 


N 


Fast fall [a] 


0.341 


0.099 


106 


Fast fall [i] 


0.324 


0.108 


105 


Fast rise [a] 


0.331 


0.084 


108 


Fast rise [i] 


0.312 


0.073 


108 


Slow fall [a] 


0.337 


0.106 


107 


Slow fall [i] 


0.326 


0.104 


106 


Slow rise [a] 


0.340 


0.102 


108 


Slow rise [i] 


0.320 


0.105 


108 



Considering the four vowel pair samples separately, Hypothesis [T] on the 
population mean DR holds at p < 0.05 by fast rises only (Welch two sample 
(one-sided) t-test with df = 209, t = —1.8, p = 0.04). The estimated 
mean and standard deviation of DR using all [i] glides are 0.320 and 0.098 
(n = 427), respectively. For all [a] glides, the respective values are 0.337 
and 0.098 (n = 429). Hence, Hypothesis [1] is verified using the full data set 
(Welch two sample test with df = 853, t = -2.5, p < 0.01). 

The first part of Hypothesis |2] is verified at p = 0.05 by fast rise [i] (one- 
sided t-test, df = 107, t = —3.1, p = 0.001) and also by using all [i] glides 
as the sample (one-sided t-test with df = 426, t = —2.7, p < 0.01). The 
population mean of DR over all [a] is not significantly different from the 
nominal value (df = 428, \t\ < 0.9, p < 0.4) but the probability of a Type 
II Error is still large. Thus, the latter part of Hypothesis |2] is not supported 
statistically conclusively by the current dataset. 

A mixed effects model supports the previous observations. Glide j = 
1, . . . , n,i from subject i produces the DR value denoted by yij, given by the 



17 



Table 2: The linear mixed- effects model results for Duration Ratio (DR). 
The fixed factors are vowel, direction and speed; the subjects are treated as 
random factors. 



Factor 


Estimate 


Std. Error 


t- value 


p- value 


Intercept 


0.338 


0.009 


39.0 





Vowel ([i]) 


-0.017 


0.007 


-2.5 


0.01 


Direction (rise) 


-0.007 


0.007 


-0.98 


0.3 


Speed (slow) 


0.004 


0.007 


0.56 


0.6 



mixed effects model 

Vi,j = A) + dijpi + s itj (3 2 + Vi d (3 3 + k + e itj 

where the fixed factors (with numerical levels or 1) are the glide direction 
dij, the glide speed sy, and the vowel type v it j. The test subject related 
random factors 6j ~ iV(//j,(7j) are independent from the experimental errors 
ejj ~ N(0,o~). As can be seen in the Table 2, the vowel type has a sig- 
nificant effect (ANOVA, df = 845, |f| > 2.5, p = 0.011) while the effect of 
the direction and speed are not significant (df = 845, \t\ < 1). In a more 
complicated model taking into account all the interactions of the factors, 
the log-likelihoods of the models were lower, and the interactions were never 
significant. 

6 Discussion 

As is usual in model validation experiments, the measurement data is much 
more ambiguous than the results of numerical simulations. The reconciliation 
of simulated and measured data is difficult because of the following two main 
challenges: 

1. The computational model could give an unrealistically strong indication 
of modal locking because of unmodelled physics (such as subglottal 
acoustics, supraglottal turbulence, and energy losses). 

2. The humans have active control mechanisms (without counterparts in 
the computational model) that are expected to reduce the observable 
effects of modal locking. 

Subglottal acoustics has been completely excluded from the computational 
model. The main reason for this is the lack of experimental data that would 
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be required to reproduce the extremely difficult absorbing boundary condi- 
tion that the progressively subdividing system of bronchi and the alveoles 
constitute. Moreover, in the four-mass glottis model of Tokuda et al. [7], 
subglottal acoustics play a minor role compared to the supraglottal acoustics. 
Hence, the coupling of the subglottal acoustics and the vocal fold oscillations 
does not seem to be the primary source of the discrepancy between simula- 
tions and experimental results. 

The supraglottal aerodynamics might have influence on the vowel glides. 
State-of-the-art Computational Fluid Dynamics (CFD) models indicate sig- 
nificant vorticity above the glottis [19]. The vorticity-induced aerodynamic 
force to the vocal folds is a genuinely fluid mechanical feature that is beyond 
the scope of the acoustic theory of speech as well as simple Bernoulli flow 
models in VT. Evaluating the contribution of this force would also require a 
more refined CFD model than is usually available. 

In low frequency glide productions, the larynx moves vertically [20] . How- 
ever, the simulated vowel glides were produced using a fixed area function, 
and no movement in the larynx area is taken into account at all. The in- 
creased pitch is a consequence of the elongation and higher stress in vocal 
folds due to rotation of the thyroid cartilage in the anterior direction. To- 
wards the end of a glide production, larynx moves as a consequence of both 
the contracting thorax and the control actions required for variable pitch. 
These two mechanisms either oppose or reinforce each other depending on 
the direction of the glide. All this applies equally to VT geometries of [a] 
and [i]. 

Considering the latter challenge [2] above, the direct modeling of the au- 
ditory, motor, and muscular mechanisms behind the task performance seems 
unfeasible at the moment. Fortunately, the experiments can be arranged in 
a manner that downplays or even interferes with the active control mech- 
anisms. For this reason, trained singers are - somewhat paradoxically - 
excluded as test subjects even though producing vowel glides resembles very 
much a singing exercise. (Recall that non-singing males were excluded for the 
opposite reason.) The drawback of having non-singers was that some of the 
subjects could not perform the glide task satisfactorily which may have led 
to selection bias. However, these subjects are expected to be less familiar in 
their vocal pitch control which would lead to less accurate compensation (and 
hence, better detection in experiments) of the modal locking disturbance. 

Control mechanisms based on auditory feedback have been shown to 
rapidly and accurately compensate deviations from target glides in glide pro- 
duction [2TJ. This could lead to asymmetric f trajectories in terms of the 
glide direction because of different delays in control and observation. Recall- 
ing the simulated fo trajectory in Fig. [3b, such a modal locking event during 
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a rising glide is detectable only right after a large jump has already occurred 
while a modal-locking event during a fall can, in principle, be detected when 
the fo has just ceased to decline. Considering the control delay alone, sup- 
pose that the compensation strategy against modal locking were only based 
on the auditory observation with an identical delay (of say 80 ms) for both 
rising and falling glides. Then the fall would be less perturbed by the modal 
locking event because then the control action has more time to react. For 
obvious reasons, however, the auditive observation of modal locking events 
in falling glides could be more delayed compared to rising glides for fo tra- 
jectory profiles as in Fig. [3b. All in all, there are convincing reasons why 
rising and falling glides in experiments are not expected to be symmetric, 
and such asymmetry could explain the fact that the fast rising [i] stands out 
when verifying Hypothesis [2] above. It must be noted that the fast rising [i] 
does not similarly stand out using the mixed effects model. 

Not much can be concluded from the observed numerical values of DR 
on the two vowel classes. First, the distribution of Fx for [i] glides was not 
controlled while it is known by the numerical experiment that a relatively 
small variation of 50 Hz in F affects the mean values of DR. Second, DR 
is a nonlinear functional, and one has to be careful in interpretations by 
continuity; e.g., DR may have very large or very small values for large jumps 
that cross both f^ and Third, the jumps never occur instantaneously in 
the data but rather show an accelerated pitch movement giving DR values 
closer to 1/3 than in the simulations. Fourth, the dynamics of the pitch 
following exercise is not understood in such detail that its effect on DR 
could be evaluated at all. Such dynamics may be a source of systematic 
error in the estimated DR values, resulting in the lack of verification of the 
latter part of Hypothesis [2] concerning [a] glides. 

The computational model predicts modal locking to subharmonics of the 
vowel formants as well. These are characterized by jumps of one octave, and, 
if they took place, they would have gone unnoticed in the data processing. 
Indeed, the pitch detection algorithm admitted only results inside one octave 
[200 Hz, 400 Hz] which was the nominal gliding range. Otherwise, large jumps 
were not observed for [a] in the current dataset but such jumps have been 
observed in other experiments [3]. 

7 Conclusions 

Results from numerical simulations and experiments have been reported on 
vowel glides. The simulations show a robust and large effect of modal lock- 
ing of fo to F\ whenever the frequencies coincide. The experimental data 
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shows a reduced effect of the same kind, supporting the computational model. 
Hence, experimental results give positive evidence for the existence of signifi- 
cant filter-source feedback in human glide productions even in an experimen- 
tal setting where the frequency jumps were implicitly asked to be avoided. 
Moreover, the results suggest that compensatory mechanisms are employed 
to avoid filter- source feedback induced f jumps near the formant frequency 

Uncontrolled fo jumps occur frequently in the phonation of boys during 
the puberty [22] . Changes in the vocal folds and the growing larynx make it 
difficult for them to produce natural intonation contours or to sing musical 
melodies. These observations may be partly due to modal locking since the 
relevant compensatory mechanisms are expected to require re-tuning after 
a radical change in the geometric dimensions of both the VT and the vocal 
folds. If the poorly compensated filter-source interaction were a significant 
cause, then these /o jumps would have to be more frequent in VT configura- 
tions where Fx is low; i.e., in high vowels such as [i] and [u]. 

Finally, we remark that the active compensation mechanisms to coun- 
teract the filter-source feedback could perhaps be observed directly. It is 
expected that especially trained singers would adjust their VT in order to 
keep the formant and the vocal pitch apart from each other. 
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